ggml: aarch64: implement SVE kernels for q8_0_q8_0, q4_0_q8_0 vector dot#7433
Conversation
d671a17 to
d28bfd5
Compare
ggerganov
left a comment
There was a problem hiding this comment.
Could you demonstrate short perplexity runs produce reasonable values compared to no-SVE?
|
Thanks for the comment! I ran perplexity with SVE and no-SVE. The following is the command and partial logs. ### Q8_0 / no-SVE
$ ./build-neon/bin/perplexity -s 0 -np 1 -t 32 -m llama-2-7b-chat.Q8_0.gguf -f wikitext-2-raw/wiki.test.raw -c 128 -b 128 --chunks 4
---
(...snip)
system_info: n_threads = 32 / 64 | AVX = 0 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 0 | NEON = 1 | SVE = 0 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 0 | SSSE3 = 0 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 |
perplexity: tokenizing the input ..
perplexity: tokenization took 906.69 ms
perplexity: calculating perplexity over 4 chunks, n_ctx=128, batch_size=128, n_seq=1
perplexity: 2.47 seconds per pass - ETA 0.15 minutes
[1]5.2130,[2]7.4447,[3]7.4725,[4]8.4178,
Final estimate: PPL = 8.4178 +/- 1.61226
llama_print_timings: load time = 314.22 ms
llama_print_timings: sample time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
llama_print_timings: prompt eval time = 9876.98 ms / 512 tokens ( 19.29 ms per token, 51.84 tokens per second)
llama_print_timings: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
llama_print_timings: total time = 10796.42 ms / 513 tokens
### Q8_0 / SVE
$ ./build-sve/bin/perplexity -s 0 -np 1 -t 32 -m llama-2-7b-chat.Q8_0.gguf -f wikitext-2-raw/wiki.test.raw -c 128 -b 128 --chunks 4
---
(...snip)
system_info: n_threads = 32 / 64 | AVX = 0 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 0 | NEON = 1 | SVE = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 0 | SSSE3 = 0 | VSX = 0 | MATMUL_INT8 = 1 | LLAMAFILE = 1 |
perplexity: tokenizing the input ..
perplexity: tokenization took 915.193 ms
perplexity: calculating perplexity over 4 chunks, n_ctx=128, batch_size=128, n_seq=1
perplexity: 0.99 seconds per pass - ETA 0.05 minutes
[1]5.2291,[2]7.4493,[3]7.4706,[4]8.4219,
Final estimate: PPL = 8.4219 +/- 1.61261
llama_print_timings: load time = 304.68 ms
llama_print_timings: sample time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
llama_print_timings: prompt eval time = 3940.02 ms / 512 tokens ( 7.70 ms per token, 129.95 tokens per second)
llama_print_timings: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
llama_print_timings: total time = 4868.40 ms / 513 tokens
### Q4_0 / no-SVE
$ ./build-neon/bin/perplexity -s 0 -np 1 -t 32 -m llama-2-7b-chat.Q4_0.gguf -f wikitext-2-raw/wiki.test.raw -c 128 -b 128 --chunks 4
---
(...snip)
system_info: n_threads = 32 / 64 | AVX = 0 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 0 | NEON = 1 | SVE = 0 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 0 | SSSE3 = 0 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 |
perplexity: tokenizing the input ..
perplexity: tokenization took 898.157 ms
perplexity: calculating perplexity over 4 chunks, n_ctx=128, batch_size=128, n_seq=1
perplexity: 2.53 seconds per pass - ETA 0.17 minutes
[1]5.4426,[2]7.4845,[3]7.9395,[4]9.0525,
Final estimate: PPL = 9.0525 +/- 1.80378
llama_print_timings: load time = 13751.66 ms
llama_print_timings: sample time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
llama_print_timings: prompt eval time = 10110.36 ms / 512 tokens ( 19.75 ms per token, 50.64 tokens per second)
llama_print_timings: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
llama_print_timings: total time = 11021.03 ms / 513 tokens
### Q4_0 / SVE
$ ./build-sve/bin/perplexity -s 0 -np 1 -t 32 -m llama-2-7b-chat.Q4_0.gguf -f wikitext-2-raw/wiki.test.raw -c 128 -b 128 --chunks 4
---
(...snip)
system_info: n_threads = 32 / 64 | AVX = 0 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 0 | NEON = 1 | SVE = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 0 | SSSE3 = 0 | VSX = 0 | MATMUL_INT8 = 1 | LLAMAFILE = 1 |
perplexity: tokenizing the input ..
perplexity: tokenization took 901.443 ms
perplexity: calculating perplexity over 4 chunks, n_ctx=128, batch_size=128, n_seq=1
perplexity: 1.09 seconds per pass - ETA 0.07 minutes
[1]5.4306,[2]7.4762,[3]7.9293,[4]9.0456,
Final estimate: PPL = 9.0456 +/- 1.80407
llama_print_timings: load time = 184.21 ms
llama_print_timings: sample time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
llama_print_timings: prompt eval time = 4340.33 ms / 512 tokens ( 8.48 ms per token, 117.96 tokens per second)
llama_print_timings: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
llama_print_timings: total time = 5254.53 ms / 513 tokensAnd below is a summary.
This correction does not appear to have any impact on accuracy. |
|
Thanks. I checked Azure Cloud to see if I can rent a node that supports Arm SVE and it seems soon there will be VMs available: https://learn.microsoft.com/en-us/azure/virtual-machines/sizes/general-purpose/dpsv6-series?tabs=sizebasic |
|
I don't understand why, but after this PR I was having build issues on one of my machines when using |
|
@ggerganov It's greate. Thank you for sharing information. If there is anything I can do to help with CI/CD for SVE implementation, I would like to contribute! |




This PR introduces support for SVE(Scalable Vector Extensions) kernels for the q4_0_q8_0 and q4_0_q8_0 vector dot on the Arm architecture. A similar proposal for SVE support is made in PR #5780, but it also includes changes to the block layout.
This PR implements the SVE vector dot with minimal changes as a first SVE support. The performance enhancement is less than that of PR #5780, but it is ~ x1.1 to x1.5 faster than the original implementation.
SVE is enabled if LLAMA_SVE=ON is set in cmake. Here is an example of the compilation commands:
Here are the performance measured on AWS Graviton3E (hpc7g).
Q4_0_Q8_0
Decoding throughput[token/sec]
Q8_0_Q8_0
Decoding throughput[token/sec]
Limitation: This pull request only supports SVE 256-bit.